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DETAILED ACTION 

1 . The text of those sections of Title 35, U.S. Code not included in this action can 
be found in a prior Office action. 

Information Disclosure Statement 

2. The examiner unintentionally failed to initial one of the document's on the 
applicant's IDS. The document has been considered and a completed 1449 Form has 
been resubmitted to the applicant. 

Response to Amendment 

3. This communication is in response to applicant's amendment dated September 

4. 2007. The applicant amended claims 1,4, 5, 8, 10, 15, 17, 18, 20, 23, and 25, and 
cancelled claims 6, 11, and 28. The applicant's remarks additionally state that claim 7 
was cancelled (see p. 9), however claim 7 remains in the claim amendments, and 
therefore is still considered as part of the applicant's claims. Claims 1-5, 7-10, and 12- 
27 are currently pending in this application. 

Response to Arguments 

4. Applicant's arguments with respect to claims 1-5, 7-10, and 12-27 have been 
considered but are moot in view of the new ground(s) of rejection. 
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Claim Rejections - 35 USC §112 

5. The following is a quotation of the first paragraph of 35 U.S.C. 112: 

The specification shall contain a written description of the invention, and of the manner and process of 
making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the 
art to which it pertains, or with which it is most nearly connected, to make and use the same and shall 
set forth the best mode contemplated by the inventor of carrying out his invention. 

6. Claims 1, 8, 15, and 20 are rejected under 35 U.S.C. 1 12, first paragraph, as 
failing to comply with the written description requirement. The claim(s) contains subject 
matter which was not described in the specification in such a way as to reasonably 
convey to one skilled in the relevant art that the inventor(s), at the time the application 
was filed, had possession of the claimed invention. 

Claims 1, 8, 15, and 20 contain the new limitation of identifying changes in 
frequencies. The original specification does not support this limitation. The specification 
merely identifies bands of frequencies. It is unclear what changes in frequency are 
being detected and/or matched. 

Claim 20 contains the additional limitation of separating audio and video into 
separate files. The original specification does not support this new limitation. 

Claim Rejections - 35 USC § 103 

7. Claims 1-5, 8-10, and 12-27 are rejected under 35 U.S.C. 103(a) as being 
unpatentable over Katsumi Patent No.: US 6,369,846 ("KATSUMI") in view of Nefian 
Pub. No.: US 2003/0212557 ("NEFIAN"). 



8. 



Regarding claim 1, KATSUMI teaches a method, comprising: 
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electronically capturing visual features associated with a speaker speaking 
("speaking attendee determination information based on video signal", KATSUMI, 
column 6, lines 52-53); 

electronically capturing audio ("speaking attendee determination information 
based on audio signal", KATSUMI, column 6, lines 50-51); 

matching selective portions of the audio with the visual features ("when the 
'speaking attendee determination information based on audio signal' represents a voice 
and the 'speaking attendee determination information based on video signal' represents 
a change of the shape of the lip portion simultaneously", KATSUMI, column 6, lines 58- 
63); and 

identifying the remaining and unmatched portions of the audio as potential noise 
not associated with the speaker speaking ("if an audio signal contains a noise such as a 
page turning noise and voices of other people along with a voice of a conference 
attendee, since an image of the motion of the lip portion of a conference attendee can 
be detected from a video signal, the speaking attendee can be determined", KATSUMI, 
column 7, lines 63-67). 

However, KATSUMI does not disclose that the method occurs during a training 
session; that the visual features include a face recognition of the speaker and a mouth 
recognition within pixels associated with the face that detects when the mouth is moving 
and when the mouth is not moving by differences in the pixels from frame to frame in 
the captured visual features during the training session; and detecting changes in 
frequencies in the captured audio during same time slices of the training session for the 
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captured visual features when the mouth is detected as moving and when the mouth is 
detected as not moving. 

In the same field of audiovisual processing, NEFIAN teaches: 

a training session ("training network and speech recognition module 18", 
NEFIAN, paragraph [0012]); 

visual features including a face recognition of the speaker and a mouth 
recognition within pixels associated with the face that detects when the mouth is moving 
and when the mouth is not moving by differences in the pixels from frame to frame in 
the captured visual features during the training session (see NEFIAN, paragraphs 
[0018]-[0020], a series of vector calculations are performed on the pixels representing 
the mouth regions); and 

detecting changes in frequencies in the captured audio ("13 MFCC coefficients 
extracted from a window of 20 ms", NEFIAN, paragraph [0055]) during same time slices 
of the training session for the captured visual features when the mouth is detected as 
moving and when the mouth is detected as not moving ("discrete nodes at time t for 
each HMM are conditioned by the discrete nodes at time t1 of all the related HMMs", 
NEFIAN, paragraph [0023]). 

Therefore, it would have been obvious to a person of ordinary skill in the art at 
the time the invention was made to use the audiovisual matching method of NEFIAN 
with the speaker determination system of KATSUMI in order to "improve the 
performance of speech recognition" (NEFIAN, paragraph [0003]). 
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9. Regarding claim 2, KATSUMI further teaches: 

electronically capturing additional visual features associated with a different 
speaker speaking ("images of terminals determined as having speaking attendees", 
KATSUMI, column 7, lines 56-57); and 

matching some of the remaining portions of the audio from the potential noise 
with the additional speaker speaking ("if an audio signal contains a noise such as a 
page turning noise and voices of other people along with a voice of a conference 
attendee, since an image of the motion of the lip portion of a conference attendee can 
be detected from a video signal, the speaking attendee can be determined", KATSUMI, 
column 7, lines 63-67). 

10. Regarding claim 3, NEFIAN further teaches generating parameters associated 
with the matching and the identifying ("audio processing and visual feature extraction", 
NEFIAN, paragraph [0012]) and providing the parameters to a Bayesian Network which 
models the speaker speaking ("video data must be fused with audio data using... a 
coupled hidden Markov model [HMM]", NEFIAN, paragraph [0023], where the HMM is a 
dynamic Bayesian network). 

1 1 . Regarding claim 4, NEFIAN further teaches that electronically capturing the 
visual features further includes processing a neural network ("neural network", NEFIAN, 
paragraph [0014]) against electronic video associated with the speaker speaking 
("speaker's face in a video sequence", NEFIAN, paragraph [0014]), wherein the neural 
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network is trained to detect and monitor the face of the speaker ("face detection", 
NEFIAN, paragraph [0014]). 

12. Regarding claim 5, NEFIAN further teaches filtering the detected face of the 
speaker to detect movement or lack of movement in the mouth of the speaker ("after the 
face is detected, mouth region discrimination is usual", NEFIAN, paragraph [0015]). 

13. Regarding claim 8, KATSUMI teaches a method, comprising: 

monitoring an electronic video of a first speaker and a second speaker ("images 
of terminals determined as having speaking attendees", KATSUMI, column 7, lines 56- 
57); 

concurrently capturing audio associated with the first and second speaker 
speaking ("voices may be contained in the audio signal", KATSUMI, column 3, line 1); 

analyzing the video to detect when the first and second speakers are moving 
their respective mouths ("extracts the change amount of the shape of the lip portion", 
KATSUMI, column 6, lines 24-25); and 

matching portions of the captured audio to the first speaker and other portions to 
the second speaker based on the analysis and wherein at least some points in the 
training session indicate that the first and second speakers are simultaneously speaking 
("if an audio signal contains a noise such as a page turning noise and voices of other 
people along with a voice of a conference attendee, since an image of the motion of the 
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lip portion of a conference attendee can be detected from a video signal, the speaking 
attendee can be determined", KATSUMI, column 7, lines 63-67). 

However, KATSUMI does not disclose a training session for face recognition of 
the first and second speakers; indications as to when mouths for the first and second 
speakers are moving and not moving from frame to frame of the video during the 
training session; audio separated from video and matched back to a corresponding 
portion of the video via a particular time slice associated with both the audio and the 
video; detecting differences in pixels within the faces occurring from frame to frame of 
the video for each of the speakers; and detecting changes in frequencies within the 
audio for a same time slice that indicates a particular mouth of one of the speakers is 
moving and by noting a particular frequency for a particular one of the speakers, and 
discerning what each is saying based on their respective frequencies that were noted. 

In the same field of audiovisual processing, NEFIAN teaches: 

a training session ("training network and speech recognition module 18", 
NEFIAN, paragraph [0012]); 

indications as to when mouths for the first and second speakers are moving and 
not moving from frame to frame of the video during the training session (see NEFIAN, 
paragraphs [0018]-[0020], a series of vector calculations are performed on the pixels 
representing the mouth regions); 

audio separated from video ("audiovisual data is separately subjected to audio 
processing and visual feature extraction 14", NEFIAN, paragraph [0012]) and matched 
back to a corresponding portion of the video ("video data must be fused with audio 
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data", NEFIAN, paragraph [0023]) via a particular time slice associated with both the 
audio and the video ("discrete nodes at time t for each HMM are conditioned by the 
discrete nodes at time t1 of all the related HMMs", NEFIAN, paragraph [0023]); 

detecting differences in pixels within the faces occurring from frame to frame of 
the video for each of the speakers (see NEFIAN, paragraphs [0018]-[0020], a series of 
vector calculations are performed on the pixels representing the mouth regions); and 

detecting changes in frequencies within the audio ("13 MFCC coefficients 
extracted from a window of 20 ms", NEFIAN, paragraph [0055]) for a same time slice 
that indicates a particular mouth of one of the speakers is moving ("discrete nodes at 
time t for each HMM are conditioned by the discrete nodes at time t1 of all the related 
HMMs", NEFIAN, paragraph [0023]) and by noting a particular frequency for a particular 
one of the speakers ("13 MFCC coefficients extracted from a window of 20 ms", 
NEFIAN, paragraph [0055]), and discerning what each is saying based on their 
respective frequencies that were noted ("for audio-only speech recognition", NEFIAN, 
paragraph [0055]). 

Therefore, it would have been obvious to a person of ordinary skill in the art at 
the time the invention was made to use the audiovisual matching method of NEFIAN 
with the speaker determination system of KATSUMI in order to "improve the 
performance of speech recognition" (NEFIAN, paragraph [0003]). 
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14. Regarding claim 9, NEFIAN further teaches modeling the analysis for 
subsequent interactions with the first and second speakers ("the result is a model for the 
underlying process", NEFIAN, paragraph [0033]). 

1 5. Regarding claim 10, NEFIAN further teaches that analyzing further includes 
processing a neural network ("neural network", NEFIAN, paragraph [0014]) for detecting 
the faces of the first and second speakers ("speaker's face in a video sequence", 
NEFIAN, paragraph [0014]) and processing vector classifying algorithms to detect when 
the first and second speakers' respective mouths are moving or not moving (see 
NEFIAN, paragraphs [0018]-[0020], a series of vector calculations is performed on the 
mouth regions). 

16. Regarding claim 13, KATSUMI further teaches identifying selective portions of 
the captured audio as noise if the selective portions have not been matched to the first 
speaker or the second speaker ("if an audio signal contains a noise such as a page 
turning noise and voices of other people along with a voice of a conference attendee, 
since an image of the motion of the lip portion of a conference attendee can be detected 
from a video signal, the speaking attendee can be determined", KATSUMI, column 7, 
lines 63-67). 

17. Regarding claim 14, NEFIAN further teaches that matching further includes 
identifying time dependencies associated with when selective portions of the electronic 
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video were monitored and when selective portions of the audio were captured ("discrete 
nodes at time t for each HMM are conditioned by the discrete nodes at time t1 of all the 
related HMMs", NEFIAN, paragraph [0023]). 

18. Regarding claim 15, KATSUMI teaches a system, comprising: 

a camera (see KATSUMI, column 4, lines 22-23, the conference terminals 
produce video signals, therefore a camera is inherent); 

a microphone (see KATSUMI, column 4, lines 22-23, the conference terminals 
produce audio signals, therefore a microphone is inherent); and 

a processing device ("MCU", KATSUMI, column 4, line 21), wherein the camera 
captures video of a speaker and communicates the video to the processing device, the 
microphone captures audio associated with the speaker and an environment of the 
speaker and communicates the audio to the processing device ("the conference 
terminals 6a to 6c multiplex video signals and audio signals of locations [A] to [C]... and 
transmit the transmission signals to the MCU", KATSUMI, column 4, lines 22-26), the 
processing device includes instructions that identifies visual features of the video where 
the speaker is speaking ("speaking attendee determination information based on video 
signal", KATSUMI, column 6, lines 52-53) and uses time dependencies to match 
portions of the audio to those visual features ("when the 'speaking attendee 
determination information based on audio signal' represents a voice and the 'speaking 
attendee determination information based on video signal' represents a change of the 
shape of the lip portion simultaneously", KATSUMI, column 6, lines 58-63). 
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However, KATSUMI does not disclose a plurality of frames within a period of time 
designated as a training session and each frame associated with a particular time slice; 
audio associated with the particular time slice of the training session; and a processing 
device that recognizes a face of the speaker in each frame of the video and a mouth 
within the face and detects when the mouth is moving or not moving from frame to 
frame of the video by changes in pixels associated with the mouth, and wherein when 
the mouth is moving a detected change in frequency within the same time slice of the 
audio identifies the speaker as speaking and a particular frequency that uniquely 
identifies the speaker when the speaker is speaking. 

In the same field of audiovisual processing, NEFIAN teaches: 
a plurality of frames ("digital form including but not limited toMPEG-2", NEFIAN, 
paragraph [0011]) within a period of time designated as a training session ("training 
network and speech recognition module 18", NEFIAN, paragraph [0012]) and each 
frame associated with a particular time slice (MPEG frames are inherently associated 
with a time slice); 

audio associated with the particular time slice of the training session ("video data 
must be fused with audio data", NEFIAN, paragraph [0023]); and 

a processing device that recognizes a face of the speaker in each frame of the 
video ("speakers face in a video sequence", NEFIAN, paragraph [0014]) and a mouth 
within the face and detects when the mouth is moving or not moving from frame to 
frame of the video by changes in pixels associated with the mouth (see NEFIAN, 
paragraphs [0018]-[0020], a series of vector calculations are performed on the pixels 
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representing the mouth regions), and wherein when the mouth is moving a detected 
change in frequency within the same time slice of the audio identifies the speaker as 
speaking ("discrete nodes at time t for each HMM are conditioned by the discrete nodes 
at time t1 of all the related HMMs", NEFIAN, paragraph [0023]) and a particular 
frequency that uniquely identifies the speaker when the speaker is speaking ("13 MFCC 
coefficients extracted from a window of 20 ms", NEFIAN, paragraph [0055]). 

Therefore, it would have been obvious to a person of ordinary skill in the art at 
the time the invention was made to use the audiovisual matching method of NEFIAN 
with the speaker determination system of KATSUMI in order to "improve the 
performance of speech recognition" (NEFIAN, paragraph [0003]). 

19. Regarding claim 16, KATSUMI further teaches that the captured video also 
includes images of a second speaker ("images of terminals determined as having 
speaking attendees", KATSUMI, column 7, lines 56-57) and the audio includes sounds 
associated with the second speaker ("voices may be contained in the audio signal", 
KATSUMI, column 3, line 1), and wherein the instructions matches some portions of the 
audio to the second speaker when some of the visual features indicate the second 
speaker is speaking ("if an audio signal contains a noise such as a page turning noise 
and voices of other people along with a voice of a conference attendee, since an image 
of the motion of the lip portion of a conference attendee can be detected from a video 
signal, the speaking attendee can be determined", KATSUMI, column 7, lines 63-67). 
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Regarding claim 17, NEFIAN further teaches instructions that interact with a 
neural network ("neural network", NEFIAN, paragraph [0014]) to detect the face of the 
speaker from the captured video ("speaker's face in a video sequence", NEFIAN, 
paragraph [0014]). 

20. Regarding claim 18, NEFIAN further teaches that the instructions interact with a 
pixel vector algorithm to detect when the mouth associated with the face moves or does 
not move within the captured video (see NEFIAN, paragraphs [0018]-[0020], a series of 
vector calculations are performed on the pixels representing the mouth regions). 

21. Regarding claim 19, NEFIAN further teaches that the instructions generate 
parameter data ("audio processing and visual feature extraction", NEFIAN, paragraph 
[0012]) that configures a Bayesian network ("video data must be fused with audio data 
using... a coupled hidden Markov model [HMM]", NEFIAN, paragraph [0023], where the 
HMM is a dynamic Bayesian network) which models subsequent interactions with the 
speaker ("the result is a model for the underlying process", NEFIAN, paragraph [0033]) 
to determine when the speaker is speaking and to determine appropriate audio to 
associate with the speaker speaking in the subsequent interactions ("speech 
recognition", NEFIAN, paragraph [0023]). 

22. Regarding claim 20, KATSUMI teaches a machine accessible medium having 
associated instructions, which when accessed, results in a machine performing: 
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separating audio and video associated with a speaker speaking into separate 
files for analysis (see KATSUMI, FIG. 3, the audio processing is separate from the video 
processing); 

identifying visual features from the video that indicate a mouth of the speaker is 
moving or not moving ("extracts the change amount of the shape of the lip portion", 
KATSUMI, column 6, lines 24-25); and 

associating portions of the audio with selective ones of the visual features that 
indicate the mouth is moving ("when the 'speaking attendee determination information 
based on audio signal' represents a voice and the 'speaking attendee determination 
information based on video signal' represents a change of the shape of the lip portion 
simultaneously", KATSUMI, column 6, lines 58-63). 

However, KATSUMI does not disclose: a training session, wherein each file 
associated with a same time line to permit specific frames of the video to be matched to 
specific frequencies of the audio during a same time slice occurring along the time line 
for the training session; identifying a face of the speaker and then identifying pixels 
within the face that represents the mouth and then noting changes in the pixels from 
frame to frame of the video along the time line; and matching changes in the 
frequencies of the audio with detected movements of the mouth during a same time 
period within the time line and associating a particular frequency with the speaker when 
the mouth is moving. 

In the same field of audiovisual processing, NEFIAN teaches: 
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a training session ("training network and speech recognition module 18", 
NEFIAN, paragraph [0012]), wherein each file associated with a same time line to 
permit specific frames of the video to be matched to specific frequencies of the audio 
during a same time slice occurring along the time line for the training session ("discrete 
nodes at time t for each HMM are conditioned by the discrete nodes at time t1 of all the 
related HMMs", NEFIAN, paragraph [0023]); 

identifying a face of the speaker ("speaker's face in a video sequence", NEFIAN, 
paragraph [0014]) and then identifying pixels within the face that represents the mouth 
and then noting changes in the pixels from frame to frame of the video along the time 
line mouth (see NEFIAN, paragraphs [0018]-[0020], a series of vector calculations are 
performed on the pixels representing the mouth regions); and 

matching changes in the frequencies of the audio with detected movements of 
the mouth during a same time period within the time line ("video data must be fused with 
audio data", NEFIAN, paragraph [0023]) and associating a particular frequency with the 
speaker when the mouth is moving ("13 MFCC coefficients extracted from a window of 
20 ms", NEFIAN, paragraph [0055]). 

Therefore, it would have been obvious to a person of ordinary skill in the art at 
the time the invention was made to use the audiovisual matching method of NEFIAN 
with the speaker determination system of KATSUMI in order to "improve the 
performance of speech recognition" (NEFIAN, paragraph [0003]). 
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23. Regarding claim 21 , KATSUMI further teaches including instructions for 
associating other portions of the audio with different ones of the visual features that 
indicate the mouth is not moving ("if an audio signal contains a noise such as a page 
turning noise and voices of other people along with a voice of a conference attendee, 
since an image of the motion of the lip portion of a conference attendee can be detected 
from a video signal, the speaking attendee can be determined", KATSUMI, column 7, 
lines 63-67). 

24. Regarding claim 22, KATSUMI further teaches instructions for: 

identifying second visual features from the video that indicate a different mouth of 
another speaker is moving or not moving ("images of terminals determined as having 
speaking attendees", KATSUMI, column 7, lines 56-57); and 

associating different portions of the audio with selective ones of the second 
visual features that indicate the different mouth is moving ("if an audio signal contains a 
noise such as a page turning noise and voices of other people along with a voice of a 
conference attendee, since an image of the motion of the lip portion of a conference 
attendee can be detected from a video signal, the speaking attendee can be 
determined", KATSUMI, column 7, lines 63-67). 



25. 



Regarding claim 23, NEFIAN further teaches instructions for: 
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processing a neural network ("neural network", NEFIAN, paragraph [0014]) to 
detect the face of the speaker ("speaker's face in a video sequence", NEFIAN, 
paragraph [0014]); and 

processing a vector matching algorithm to detect movements of the mouth of the 
speaker within the detected face (see NEFIAN, paragraphs [0018]-[0020], a series of 
vector calculations are performed on the pixels representing the mouth regions). 

26. Regarding claim 24, KATSUMI further teaches that the instructions for 
associating further include instructions for matching same time slices associated with a 
time that the portions of the audio were captured and the same time during which the 
selective ones of the visual features were captured within the video ("when the 
'speaking attendee determination information based on audio signal' represents a voice 
and the 'speaking attendee determination information based on video signal' represents 
a change of the shape of the lip portion simultaneously", KATSUMI, column 6, lines 58- 
63). 

27. Regarding claim 25, KATSUMI teaches an apparatus, residing in a computer- 
accessible medium, comprising: 

face detection logic ("detects at least the lip portion of a conference attendee 
from the video signal", KATSUMI, column 6, lines 23-34); 

mouth detection logic ("extracts the change amount of the shape of the lip 
portion", KATSUMI, column 6, lines 24-25); and 
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audio-video matching logic, wherein the face detection logic detects a face of a 
speaker within a video ("detects at least the lip portion of a conference attendee", 
KATSUMI, column 6, lines 23-34), the mouth detection logic detects and monitors 
movement and non-movement of a mouth included within the face of the video 
("extracts the change amount of the shape of the lip portion", KATSUMI, column 6, lines 
24-25), and the audio-video matching logic matches captured audio with any 
movements identified by the mouth detection logic ("when the 'speaking attendee 
determination information based on audio signal' represents a voice and the 'speaking 
attendee determination information based on video signal' represents a change of the 
shape of the lip portion simultaneously", KATSUMI, column 6, lines 58-63). 

However, KATSUMI does not disclose: specific frequencies occurring within 
captured audio during a training session and for a same time slice of that training 
session, and wherein the mouth is detected as moving by changes in pixels that 
represent the mouth within the face that occur from frame to frame of the video. 

In the same field of audiovisual processing, NEFIAN teaches: 

specific frequencies occurring within captured audio ("13 MFCC coefficients 
extracted from a window of 20 ms", NEFIAN, paragraph [0055]) during a training 
session ("training network and speech recognition module 18", NEFIAN, paragraph 
[0012]) and for a same time slice of that training session ("discrete nodes at time t for 
each HMM are conditioned by the discrete nodes at time t1 of all the related HMMs", 
NEFIAN, paragraph [0023]), and wherein the. mouth is detected as moving by changes 
in pixels that represent the mouth within the face that occur from frame to frame of the 
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video (see NEFIAN, paragraphs [0018]-[0020], a series of vector calculations are 
performed on the pixels representing the mouth regions). 

Therefore, it would have been obvious to a person of ordinary skill in the art at 
the time the invention was made to use the audiovisual matching method of NEFIAN 
with the speaker determination system of KATSUMI in order to "improve the 
performance of speech recognition" (NEFIAN, paragraph [0003]). 

28. Regarding claim 26, NEFIAN further teaches that the apparatus is used to 
configure a Bayesian network which models the speaker speaking ("video data must be 
fused with audio data using... a coupled hidden Markov model [HMM]", NEFIAN, 
paragraph [0023], where the HMM is a dynamic Bayesian network). 

29. Regarding claim 27, NEFIAN further teaches that the face detection logic 
comprises a neural network ("neural network", NEFIAN, paragraph [0014]). 

30. Claims 7 and 12 are rejected under 35 U.S.C. 103(a) as being unpatentable 
over Katsumi Patent No.: US 6,369,846 ("KATSUMI") in view of Nefian Pub. No.: US 
2003/0212557 ("NEFIAN") and further in view of Van Schyndel Patent No.: US 
5,940,118 ("VAN SCHYNDEL"). 

31 . Regarding claim 7, the combination of KATSUMI and NEFIAN teach all the 
limitations of claim 1 . 
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However, NEFIAN and KATSUMI do not specifically disclose suspending the 
capturing of audio during periods where select ones of the captured visual features 
indicate that the speaker is not speaking. 

In the same field of audiovisual processing, VAN SCHYNDEL teaches 
suspending the capturing of audio during periods where select ones of the captured 
visual features indicate that the speaker is not speaking ("uses optical information to 
optimally select and/or steer a microphone array in the direction of the talker", VAN 
SCHYNDEL, column 2, lines 55-58, meaning audio is not captured for someone who is 
not speaking). 

Therefore, it would have been obvious to a person of ordinary skill in the art at 
the time the invention was made to use selectable microphones of VAN SCHYNDEL 
with the speaker determination system of KATSUMI and audiovisual matching method 
of NEFIAN in order to not restrict a talker's movement or position (VAN SCHYNDEL, 
column 2, lines 60-61). 

32. Regarding claim 12, the combination of KATSUMI and NEFIAN teach all the 
limitations of claim 8. 

However KATSUMI and NEFIAN do not specifically disclose suspending the 
capturing of audio when the analysis does not detect the mouths moving for the first and 
second speakers. 

In the same field of audiovisual processing, VAN SCHYNDEL teaches 
suspending the capturing of audio when the analysis does not detect the mouths 
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moving for the first and second speakers, ("uses optical information to optimally select 
and/or steer a microphone array in the direction of the talker", VAN SCHYNDEL, column 
2, lines 55-58, meaning audio is not captured for someone who is not speaking). 

Therefore, it would have been obvious to a person of ordinary skill in the art at 
the time the invention was made to use selectable microphones of VAN SCHYNDEL 
with the speaker determination system of KATSUMI and audiovisual matching method 
of NEFIAN in order to not restrict a talker's movement or position (VAN SCHYNDEL, 
column 2, lines 60-61). 

t 

Conclusion 

33. Applicant's amendment necessitated the new ground(s) of rejection presented in 
this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP 
§ 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 
CFR 1.136(a). 

A shortened statutory period for reply to this final action is set to expire THREE 
MONTHS from the mailing date of this action. In the event a first reply is filed within 
TWO MONTHS of the mailing date of this final action and the advisory action is not 
mailed until after the end of the THREE-MONTH shortened statutory period, then the 
shortened statutory period will expire on the date the advisory action is mailed, and any 
extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of 
the advisory action. In no event, however, will the statutory period for reply expire later 
than SIX MONTHS from the date of this final action. 
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Any inquiry concerning this communication or earlier communications from the 
examiner should be directed to Joel Stoffregen whose telephone number is (571) 270- 
1454. The examiner can normally be reached on Monday - Friday, 9:00 a.m. - 6:30 
p.m.. 

If attempts to reach the examiner by telephone are unsuccessful, the examiner's 
supervisor, Patrick Edouard can be reached on (571) 272-7603. The fax phone number 
for the organization where this application or proceeding is assigned is 571-273-8300. 

Information regarding the status of an application may be obtained from the 
Patent Application Information Retrieval (PAIR) system. Status information for 
published applications may be obtained from either Private PAIR or Public PAIR. 
Status information for unpublished applications is available through Private PAIR only. 
For more information about the PAIR system, see http://pair-direct.uspto.gov. Should 
you have questions on access to the Private PAIR system, contact the Electronic 
Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a 
USPTO Customer Service Representative or access to the automated information 
system, call 800-786-91 99 (IN USA OR CANADA) or 571 -272-1 000. 




— ~ PATRICK N. EDOUARD 
SUPERVISORY PATENT EXAMINER 



